Simulation Exercise – Job Demonstration

by Barry Daemi

Wells Fargo Quantitative Analytics Program - Risk Analytics & Decision Science Track (Master's)

Southern Methodist University


November 12, 2022

$$\newcommand{\C}{\mathbb{C}} \newcommand{\R}{\mathbb{R}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\P}{\mathbb{P}} \newcommand{\F}{\mathbb{F}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}}$$
  1. Task Description:

         Financial institutions that lend to consumers rely on models to help decide on who to approve or decline for credit (for lending products such as credit, automobile loans, or home loans). In this job simulation, your task is to develop models that review credit card applications to determine which ones should be approved. You are given historical data containing one response (binary) and 20 predictor variables from credit card accounts for a hypothetical bank XYZ.

Introduction

     A large wealth of literature is available on the interwebs concerning the topics of model selection for credit application approval [1], creddit approval analysis and modelling [2,3], and model validation [2]. And in consequence there exists over-abundance of pathways of analysis and model development for credit approval; though for the purpose of the completing the aforementationed Task Description, we decided to only select two predominant supervised machine learning classifier algorithm to perform the modeling aspect of the project; we choice Logistic Regression classifer and Random forest classifier to perform the two modeling section of this project.
     In the first section, we performed the necessary data formatting and analysis on the provided dataset, to ready it for the purpose of modeling. In the second section, we develop the theoretical underpinnings of a neural network implementation of logistic regression, though we did develop our own proprietary software instead we relied upon the open-source neural network implemented logistic regression from sklearn. We follow the logistic regression model's results with statistical inference commentary. In the third section, we cover the theoretical framework for random forest, and implement said model through the open-source package sklearn. Same with the results of the logistic regression model, we following the results of the random forest model with statistical inference based commentary. Lastly in the fourth section, we decision model selection, and model use in real-world credit approval application.
     For the purpose of model replication, we used the following code block alongside our package implementation, we added print statement to the display in the console the versions of each package that was utilized.

In [1]:
import numpy as np
import scipy.special as sp
import pandas as pd
import statsmodels.api as sm
import seaborn
import matplotlib
import matplotlib.pyplot as plt
import sklearn as sk
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
print("Numpy " + str(np.__version__));
print("Pandas " + str(pd.__version__));
print("Seaborn "+str(seaborn.__version__));
print("Matplotlib.pyplot "+str(matplotlib.__version__));
print("Sklearn "+str(sk.__version__));
Numpy 1.21.5
Pandas 1.4.2
Seaborn 0.11.2
Matplotlib.pyplot 3.5.1
Sklearn 1.0.2

Just for the desire for creativity we provided the Task Description inside a center gray box. To create said gray block containing the Task description text, we required a snippet of CSS code. As Markdown compiles into HTML, and thus, requires a HTML compiler to compile and execute said code, like others, we knew Jupyter Notebook possesses a HTML compiler. And therefore, we knew we could pass on the following CSS code to the HTML compiler, which has the capacity to call the proper CSS compiler to compile and execute said CSS code [4]. An additional note is that the Markdown text blocks do not contain Markdown code, but instead straight HTML code. The purpose was to retain the full-functionality of the HTML5 as a markup language, which can heavily custom text, which can further the aestethic of a said document.

In [2]:
%%html
<style>#toc_container {
background: #f9f9f9 none repeat scroll 0 0;
border: 1px solid #aaa;
display: table;
font-size: 85%;
margin-bottom: 1em;
padding: 20px;
width: auto;
}
.toc_title {
font-weight: 700;
text-align: center;
}
#toc_container li, #toc_container ul, #toc_container ul li{
list-style: outside none none !important;
}</style>
<style>
table, th, td {
  border:1px solid black;
}
</style>

Section 1: Data Analysis

     We imported the training dataset, Training_R-197135_Candidate Attach #1_JDSE_SRF #456.csv.csv through Pandas.read_csv function; we named the imported dataframe as train_df. For reader convenience, we printed train_df to the console, so they are able observe the what was the resulted dataset.

In [3]:
train_df=pd.read_csv("Training_R-197135_Candidate Attach #1_JDSE_SRF #456.csv.csv")
train_df
Out[3]:
tot_balance avg_bal_cards credit_age credit_age_good_account credit_card_age num_acc_30d_past_due_12_months num_acc_30d_past_due_6_months num_mortgage_currently_past_due tot_amount_currently_past_due num_inq_12_month ... num_card_12_month num_auto_ 36_month uti_open_card pct_over_50_uti uti_max_credit_line pct_card_over_50_uti ind_XYZ rep_income rep_education Def_ind
0 102956.11010 14819.057400 238 104 264 0 0 0 0.000000 0 ... 1 0 0.366737 0.342183 0.513934 0.550866 0 118266.32130 college 0
1 132758.72580 18951.934550 384 197 371 0 0 0 0.000000 0 ... 0 0 0.490809 0.540671 0.418016 NaN 0 89365.05765 college 0
2 124658.91740 15347.929690 277 110 288 0 0 0 0.000000 0 ... 0 0 0.359074 0.338560 0.341627 0.451417 0 201365.12130 college 0
3 133968.53690 14050.713340 375 224 343 0 0 0 0.000000 2 ... 1 0 0.700379 0.683589 0.542940 0.607843 0 191794.48550 college 0
4 143601.80170 14858.515270 374 155 278 0 0 0 0.000000 0 ... 0 0 0.647351 0.510812 0.632934 0.573680 0 161465.36790 graduate 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19995 89665.13930 11521.159950 319 139 363 0 0 0 0.000000 0 ... 0 0 0.535628 0.634712 0.527230 0.602345 0 NaN high_school 0
19996 136211.63530 17977.054130 297 137 273 0 0 0 0.000000 2 ... 0 0 0.464774 0.450030 0.545108 NaN 1 NaN high_school 0
19997 110721.87650 13316.820540 304 151 257 0 0 0 0.000000 0 ... 0 0 0.264544 0.340289 0.412155 NaN 0 157706.15810 college 0
19998 96742.36371 11743.262370 275 141 294 2 1 1 3009.387661 0 ... 0 0 0.609226 0.582007 0.301612 0.697052 1 97387.97414 college 1
19999 107338.82070 7942.952546 325 195 302 0 0 0 0.000000 0 ... 0 0 0.358067 0.435511 0.349246 NaN 0 165447.16380 college 0

20000 rows × 21 columns

     The resulted dataset, train_df, possessed twenty-thousand observations and twenty-one variables, twenty of which were the predictor features, and while only one variable was the target variable, Def_Ind. We created the following reference table with the details of the twenty predictor features each with their own accompanied description.

tot_balance: Total balance num_card_inq_24
_month
:
Number of credit card inquiries in last 24 months
avg_bal_cards: Average balance over all active cards num_card_12_month: Number of credit cards opened in last 12 months
credit_age: Age in months of first credit product num_auto_ 36_month: Number of auto loans opened in last 36 months
credit_age_good
_account
:
Age in months of oldest credit product obtained uti_open_card: Utilization on open credit card accounts
credit_card_age: Age in months of applicant’s oldest credit card pct_over_50_uti: Percentage of open accounts with over 50% utilization
num_acc_30d_past_
due_12_months
:
Number of accounts that are 30 or more days delinquent within last 12 months uti_max_credit
_line
:
Utilization on credit account with highest credit limit
num_acc_30d_past_
due_6_months
:
Number of accounts that are 30 or more days delinquent within last 6 months pct_card_over
_50_uti
:
Percentage of open credit cards with over 50% utilization
num_mortgage_
currently_past
_due
:
Number of mortgages delinquent in last 6 months ind_XYZ: Indicator: 1 if applicant already has some account (checking/savings, etc.) with the bank XYZ; 0 otherwise
tot_amount_
currently_past
_due
:
Total amount past due currently for all credit accounts rep_income: annual income (self-reported by applicant and not verified)
num_inq_12_month: Number of inquiries in last 12 months rep_education: education level (self-reported by applicant and not verified): Four levels: high-school or below; college degree; graduate degree; other

We also created a reference table for the target variable Def_Ind.

Def_Ind: Binary: 1 = account defaulted after an account was approved and opened with bank XYZ in the past 18 months; 0 = not defaulted

The source of these variable descriptions were sourced directly from the Task description documentation provided; we hope these quick reference table prove useful to the reader.
     Throgh train_df.isna().any(), we found that the columns: pct_card_over_50_uti, rep_income and rep_education all contained missing data. With further inquiry through train_df.isna().sum(), we were able to discern that pct_card_over_50_uti had $1,958$ missing values, pct_card_over_50_uti had $1,559$ missing values and rep_education only had one missing value. This discernment is important, as the observations with missing data that are left in the training dataset will contribute to an increase in bias in eiterh the training of an logistic regression, or random forest algorithm, e.g., the missing data will contribute to a overfitted model. To prevent this increase in bias in the logistic regression, or random forest model due to the missing data; we had to delete the observations that had missing values [5]. Though deletion of observation with missing data can be the simplest and most convenient approaches to said bias problem, it can also result in an increase in bias in the model, given the column variables that are missing are not randomly distributed random variables, but instead deterministic values [6].

Deletion

In this approach all entries with missing values are removed/discarded when doing analysis. Deletion is considered the simplest approach as there is no need to try and estimate value. However, the authors of Little and Rubin [18] have demonstrated some of the weakness of deletion, as it introduce bias in analysis, especially when the missing data is not randomly distributed. The process of deletion can be carried out in two ways, pairwise or list-wise deletion [32].

Section: Missing values approaches. A survey on missing data in machine learning by Emmanuel et al. (2021) [6]

Fortuntely the each of the columns values that possessed missing values were all sampled random variables, and therefore data is randomly distributed. Though on an additional note, it would be remiss of us, if we did not mention that in Emmanuel et al. (2021) [6], that the purpose of said research article was the development of a algorithms, "k nearest neighbor and an iterative imputation method (missForest)" [6], derived from the random forest algorithm, which can handle missing data with far better success then conventional random forest; though the implication and implementation of said algorithms are outside the scope of this project - though we encourage the reader to read said article, as it seems promising in a research perspective.
     Though the theory behind duplicated data is not yet settled, we have observed suggestions, such as [7], that duplicated data can contribute to a model's varience, in otherwords, duplicated data can cause an overfitting of the model due to over emphasizing duplicated data over novel data during training. Overfitting in a mathematical term is the "loss of generality", that a solution can experience if it only applicable to a unique set of cases, in this context a unique set of datasets. With said lost of generality, said model predictive capacity is limited to the aforementationed unique set of datasets that it was fitted. As this is not the desire result, we decided to discern if duplicated data existed in train_df, through train_df2.duplicated(); foruntately, all of the observations were found to be unique.
     We implemented train_df.dtypes, as a way to observe the data types of each column; we found that rep_education was an object data type, which neither logistic regression nor random variable is able handle; as a result we converted rep_education from a object column to a categorical column through an if nested boolean tree. As no specifications were given in regards to report educational attainment, we decide through assumption that we categorize 'other' as $1$, 'high_school' as $2$, 'college' as $3$ and 'graduate' as $4$. In essence, we made the assumption that educational attianment was a tier, that it was ordered as, rep_education: other $<$ high school $<$ college $<$ graduate.
     After the following data formatting steps, train_df size changed from (20000, 21) to (16653, 21), and was renamed to train_df2.

In [4]:
print(train_df.isna().any()); # Checks for missing data in each column
print(" ");
print(train_df.isna().sum()); # The number of missing data in each column
print(" ");
print(train_df.dtypes)
print(" ")
train_df2=train_df.dropna();
a=list(train_df2['rep_education'])
b=[];

for i in range(len(a)):
    if a[i]=='Other':
        b.append(1)
    elif a[i]=='high_school':
        b.append(2)
    elif a[i]=='college':
        b.append(3)
    elif a[i]=='graduate':
        b.append(4)
    else:
        b.append(0);
        
train_df2['rep_education']=b;
print('Number of duplicate accounts: '+str(sum(train_df2.duplicated()))); # Number of duplicate obs.
print(" ");
print("Size of train_df: " +str(train_df.shape));
print("Size of train_df2 "+str(train_df2.shape));
tot_balance                        False
avg_bal_cards                      False
credit_age                         False
credit_age_good_account            False
credit_card_age                    False
num_acc_30d_past_due_12_months     False
num_acc_30d_past_due_6_months      False
num_mortgage_currently_past_due    False
tot_amount_currently_past_due      False
num_inq_12_month                   False
num_card_inq_24_month              False
num_card_12_month                  False
num_auto_ 36_month                 False
uti_open_card                      False
pct_over_50_uti                    False
uti_max_credit_line                False
pct_card_over_50_uti                True
ind_XYZ                            False
rep_income                          True
rep_education                       True
Def_ind                            False
dtype: bool
 
tot_balance                           0
avg_bal_cards                         0
credit_age                            0
credit_age_good_account               0
credit_card_age                       0
num_acc_30d_past_due_12_months        0
num_acc_30d_past_due_6_months         0
num_mortgage_currently_past_due       0
tot_amount_currently_past_due         0
num_inq_12_month                      0
num_card_inq_24_month                 0
num_card_12_month                     0
num_auto_ 36_month                    0
uti_open_card                         0
pct_over_50_uti                       0
uti_max_credit_line                   0
pct_card_over_50_uti               1958
ind_XYZ                               0
rep_income                         1559
rep_education                         1
Def_ind                               0
dtype: int64
 
tot_balance                        float64
avg_bal_cards                      float64
credit_age                           int64
credit_age_good_account              int64
credit_card_age                      int64
num_acc_30d_past_due_12_months       int64
num_acc_30d_past_due_6_months        int64
num_mortgage_currently_past_due      int64
tot_amount_currently_past_due      float64
num_inq_12_month                     int64
num_card_inq_24_month                int64
num_card_12_month                    int64
num_auto_ 36_month                   int64
uti_open_card                      float64
pct_over_50_uti                    float64
uti_max_credit_line                float64
pct_card_over_50_uti               float64
ind_XYZ                              int64
rep_income                         float64
rep_education                       object
Def_ind                              int64
dtype: object
 
Number of duplicate accounts: 0
 
Size of train_df: (20000, 21)
Size of train_df2 (16653, 21)
C:\Users\Barry\AppData\Local\Temp\ipykernel_15592\3229321580.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df2['rep_education']=b;

     To best summarize the relationship between predictor variables, we used seaborn.pairplot(train_df2), which produced the following graphic. Note by double clicking the graphic, activates a zoom, so that the graphic become legible.

In [5]:
g=seaborn.pairplot(train_df2);
No description has been provided for this image

     Following code block attains the correlation matrix, $\text{Corr}$, of the dataset train_df2, and draws a heatmap corresponding to the correlation in each predictor feature, though the map does not display meaningless self-correlation values, that are found in the diagonal elements.

In [6]:
corr=train_df2.corr(); # Correlation matrix
mask=np.zeros_like(corr,dtype=bool); # Generate a mask for the upper triangle
mask[np.triu_indices_from(mask)]=True;

f,ax =plt.subplots(figsize=(11,9)); # Set up the matplotlib figure

# Generate a custom diverging colormap
cmap=seaborn.diverging_palette(220,10,as_cmap=True); 
# Draw the heatmap with the mask and correct aspect ratio
seaborn.heatmap(corr,mask=mask,cmap=cmap,vmax=0.5,
    linewidths=0.5,cbar_kws={"shrink": 0.5},ax=ax,);
No description has been provided for this image
In [7]:
corr
Out[7]:
tot_balance avg_bal_cards credit_age credit_age_good_account credit_card_age num_acc_30d_past_due_12_months num_acc_30d_past_due_6_months num_mortgage_currently_past_due tot_amount_currently_past_due num_inq_12_month ... num_card_12_month num_auto_ 36_month uti_open_card pct_over_50_uti uti_max_credit_line pct_card_over_50_uti ind_XYZ rep_income rep_education Def_ind
tot_balance 1.000000 0.706008 0.018683 0.008407 0.017945 -0.025365 -0.014801 -0.020003 -0.017746 -0.018269 ... -0.015882 -0.000527 -0.025588 -0.028925 -0.013396 -0.019086 -0.004480 0.002867 0.010655 -0.090389
avg_bal_cards 0.706008 1.000000 0.012409 0.008274 0.010635 -0.017284 -0.006276 -0.015438 -0.009335 -0.010216 ... -0.014330 0.001825 -0.027921 -0.027622 -0.021695 -0.020415 0.000823 -0.001124 0.006176 -0.112316
credit_age 0.018683 0.012409 1.000000 0.799485 0.851878 -0.033706 -0.021219 -0.018264 -0.028571 -0.024310 ... -0.010588 0.001013 -0.046747 -0.039747 -0.035542 -0.041107 0.008604 0.014881 0.026740 -0.101712
credit_age_good_account 0.008407 0.008274 0.799485 1.000000 0.676699 -0.031891 -0.024850 -0.022256 -0.029034 -0.017317 ... -0.013260 0.001666 -0.033019 -0.030550 -0.027794 -0.028352 0.010351 0.006632 0.024705 -0.080229
credit_card_age 0.017945 0.010635 0.851878 0.676699 1.000000 -0.025152 -0.014529 -0.014072 -0.021689 -0.021692 ... -0.002793 -0.000250 -0.048297 -0.039013 -0.041591 -0.039819 0.007888 0.015414 0.020083 -0.087758
num_acc_30d_past_due_12_months -0.025365 -0.017284 -0.033706 -0.031891 -0.025152 1.000000 0.710836 0.730372 0.807057 0.037867 ... 0.018590 0.002128 0.057994 0.042795 0.037550 0.052665 -0.023432 -0.003573 -0.023003 0.278412
num_acc_30d_past_due_6_months -0.014801 -0.006276 -0.021219 -0.024850 -0.014529 0.710836 1.000000 0.740790 0.778626 0.034995 ... 0.007625 0.005726 0.031946 0.024165 0.020911 0.022819 -0.007584 -0.009195 -0.011348 0.242955
num_mortgage_currently_past_due -0.020003 -0.015438 -0.018264 -0.022256 -0.014072 0.730372 0.740790 1.000000 0.767837 0.030150 ... 0.016962 0.002660 0.038531 0.031204 0.023445 0.033175 -0.013731 -0.016240 -0.016327 0.247359
tot_amount_currently_past_due -0.017746 -0.009335 -0.028571 -0.029034 -0.021689 0.807057 0.778626 0.767837 1.000000 0.034072 ... 0.008523 0.001563 0.037530 0.027323 0.024718 0.030772 -0.010888 -0.004745 -0.019133 0.258291
num_inq_12_month -0.018269 -0.010216 -0.024310 -0.017317 -0.021692 0.037867 0.034995 0.030150 0.034072 1.000000 ... 0.017702 0.004177 0.040801 0.030027 0.036274 0.037665 -0.037070 -0.005109 -0.024837 0.130904
num_card_inq_24_month -0.011203 -0.007526 -0.020134 -0.016543 -0.016279 0.037378 0.035194 0.032946 0.036066 0.901963 ... 0.008290 0.008430 0.037722 0.023515 0.030282 0.033351 -0.039109 -0.002521 -0.022003 0.115971
num_card_12_month -0.015882 -0.014330 -0.010588 -0.013260 -0.002793 0.018590 0.007625 0.016962 0.008523 0.017702 ... 1.000000 0.110687 0.003879 0.003308 -0.000449 0.010787 -0.009833 -0.022956 0.002878 0.028948
num_auto_ 36_month -0.000527 0.001825 0.001013 0.001666 -0.000250 0.002128 0.005726 0.002660 0.001563 0.004177 ... 0.110687 1.000000 -0.009198 -0.008404 -0.009326 -0.002606 -0.000397 -0.016421 -0.008177 0.006338
uti_open_card -0.025588 -0.027921 -0.046747 -0.033019 -0.048297 0.057994 0.031946 0.038531 0.037530 0.040801 ... 0.003879 -0.009198 1.000000 0.749143 0.749256 0.846833 -0.021142 -0.004465 -0.032851 0.209379
pct_over_50_uti -0.028925 -0.027622 -0.039747 -0.030550 -0.039013 0.042795 0.024165 0.031204 0.027323 0.030027 ... 0.003308 -0.008404 0.749143 1.000000 0.566553 0.630853 -0.019144 -0.003434 -0.027878 0.168094
uti_max_credit_line -0.013396 -0.021695 -0.035542 -0.027794 -0.041591 0.037550 0.020911 0.023445 0.024718 0.036274 ... -0.000449 -0.009326 0.749256 0.566553 1.000000 0.634889 -0.016914 -0.004515 -0.022459 0.158442
pct_card_over_50_uti -0.019086 -0.020415 -0.041107 -0.028352 -0.039819 0.052665 0.022819 0.033175 0.030772 0.037665 ... 0.010787 -0.002606 0.846833 0.630853 0.634889 1.000000 -0.016418 -0.000885 -0.033578 0.174590
ind_XYZ -0.004480 0.000823 0.008604 0.010351 0.007888 -0.023432 -0.007584 -0.013731 -0.010888 -0.037070 ... -0.009833 -0.000397 -0.021142 -0.019144 -0.016914 -0.016418 1.000000 0.006390 0.017410 -0.040863
rep_income 0.002867 -0.001124 0.014881 0.006632 0.015414 -0.003573 -0.009195 -0.016240 -0.004745 -0.005109 ... -0.022956 -0.016421 -0.004465 -0.003434 -0.004515 -0.000885 0.006390 1.000000 0.013797 -0.000740
rep_education 0.010655 0.006176 0.026740 0.024705 0.020083 -0.023003 -0.011348 -0.016327 -0.019133 -0.024837 ... 0.002878 -0.008177 -0.032851 -0.027878 -0.022459 -0.033578 0.017410 0.013797 1.000000 -0.030515
Def_ind -0.090389 -0.112316 -0.101712 -0.080229 -0.087758 0.278412 0.242955 0.247359 0.258291 0.130904 ... 0.028948 0.006338 0.209379 0.168094 0.158442 0.174590 -0.040863 -0.000740 -0.030515 1.000000

21 rows × 21 columns

     Certain pair of predictor features possessed low positive correlation with each other, such as tot_balance and avg_card_balance, which entirely make sense. Though these relationships are interesting, we are more concerned with the relationship between the predictor features and the target variable Def_ind. The following predictor features possessed a weak positive relationship to the indicator of default Def_ind; num_acc_30d_past_due_12_months, num_acc_30d_past_due_6_months, num_mortgage_currently_past_due, tot_amount_currently_past_due and num_card_12_month. While these following predictor features possessed a weak negative relationship to the default indactor; avg_bal_cards and credit_age. These observation will be important later in Section 5: How Model Improves Decision Making?.
     We imported the test dataset, Test_R-197135_Candidate Attach #2_JDSE_SRF #456.csv through Pandas.read_csv function; we named the imported test dataframe as test_df. For reader convenience, we printed test_df to the console, so they are able observe the what was the resulted dataset.

In [8]:
test_df=pd.read_csv("Test_R-197135_Candidate Attach #2_JDSE_SRF #456.csv")
test_df
Out[8]:
tot_balance avg_bal_cards credit_age credit_age_good_account credit_card_age num_acc_30d_past_due_12_months num_acc_30d_past_due_6_months num_mortgage_currently_past_due tot_amount_currently_past_due num_inq_12_month ... num_card_12_month num_auto_ 36_month uti_open_card pct_over_50_uti uti_max_credit_line pct_card_over_50_uti ind_XYZ rep_income rep_education Def_ind
0 75061.45088 11051.42462 191 103 220 0 0 0 0.000000 0 ... 0 0 0.417116 0.490809 0.400379 0.429427 1 200321.9635 high_school 0
1 89792.74848 13839.37518 140 145 152 1 0 0 0.000000 0 ... 0 1 0.472116 0.505581 0.655517 0.501279 0 168452.9762 high_school 0
2 95928.23392 10437.19476 343 220 388 2 0 0 19530.997450 0 ... 0 1 0.394099 0.551539 0.309663 0.482915 1 190633.9622 other 0
3 124957.43040 17413.10572 232 97 235 0 0 0 0.000000 0 ... 2 1 0.492846 0.540109 0.590457 0.466224 1 106712.5622 high_school 0
4 75058.13462 12326.23680 236 165 280 0 0 0 0.000000 0 ... 1 1 0.381452 0.344772 0.526555 0.345455 0 173172.1864 college 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 62927.10171 16602.57606 271 127 291 0 0 0 887.199204 0 ... 1 0 0.396578 0.519155 0.301686 NaN 1 151388.6128 college 0
4996 98348.24852 16093.02091 141 117 184 0 0 0 0.000000 0 ... 1 0 0.447068 0.523186 0.426136 0.509175 0 105431.9853 college 0
4997 49262.82310 10029.11290 321 144 367 0 0 0 0.000000 6 ... 0 0 0.476359 0.449276 0.524794 0.578619 0 110293.0904 college 0
4998 116989.85900 15803.98673 282 153 275 0 0 0 0.000000 1 ... 0 0 0.379345 0.505389 0.401324 0.497443 1 140715.4635 college 0
4999 106157.16290 15439.35593 182 68 185 0 0 0 0.000000 0 ... 1 0 0.406199 0.479625 0.424671 0.396460 0 140042.8599 college 0

5000 rows × 21 columns

     test_df is the test dataset, it containthe same predictor features as train_df and five-thousand observations. To clean the dataset for the purpose of its use to test the trained logistic regression and random forest models, we had to repeat the formatting process that we conducted on training dataset, on the test dataset.
     An additional note regarding missing data; as the eiher trained model could not handle missing data, yet again we had to data observations that had missing values.
     After the following data formatting steps, test_df size changed from (5000, 21) to (4133, 21), and was renamed to test_df2.

In [9]:
print(test_df.isna().any()); # Checks for missing data in each column
print(" ");
print(test_df.isna().sum()); # The number of missing data in each column
print(" ");
print(test_df.dtypes)
print(" ")
test_df2=test_df.dropna();
a=list(test_df2['rep_education'])
b=[];

for i in range(len(a)):
    if a[i]=='Other':
        b.append(1)
    elif a[i]=='high_school':
        b.append(2)
    elif a[i]=='college':
        b.append(3)
    elif a[i]=='graduate':
        b.append(4)
    else:
        b.append(0);
        
test_df2['rep_education']=b;
print('Number of duplicate accounts: '+str(sum(test_df2.duplicated()))); # Number of duplicate obs.
print(" ");
print("Size of train_df: " +str(test_df.shape));
print("Size of train_df2 "+str(test_df2.shape));
tot_balance                        False
avg_bal_cards                      False
credit_age                         False
credit_age_good_account            False
credit_card_age                    False
num_acc_30d_past_due_12_months     False
num_acc_30d_past_due_6_months      False
num_mortgage_currently_past_due    False
tot_amount_currently_past_due      False
num_inq_12_month                   False
num_card_inq_24_month              False
num_card_12_month                  False
num_auto_ 36_month                 False
uti_open_card                      False
pct_over_50_uti                    False
uti_max_credit_line                False
pct_card_over_50_uti                True
ind_XYZ                            False
rep_income                          True
rep_education                       True
Def_ind                            False
dtype: bool
 
tot_balance                          0
avg_bal_cards                        0
credit_age                           0
credit_age_good_account              0
credit_card_age                      0
num_acc_30d_past_due_12_months       0
num_acc_30d_past_due_6_months        0
num_mortgage_currently_past_due      0
tot_amount_currently_past_due        0
num_inq_12_month                     0
num_card_inq_24_month                0
num_card_12_month                    0
num_auto_ 36_month                   0
uti_open_card                        0
pct_over_50_uti                      0
uti_max_credit_line                  0
pct_card_over_50_uti               489
ind_XYZ                              0
rep_income                         410
rep_education                        4
Def_ind                              0
dtype: int64
 
tot_balance                        float64
avg_bal_cards                      float64
credit_age                           int64
credit_age_good_account              int64
credit_card_age                      int64
num_acc_30d_past_due_12_months       int64
num_acc_30d_past_due_6_months        int64
num_mortgage_currently_past_due      int64
tot_amount_currently_past_due      float64
num_inq_12_month                     int64
num_card_inq_24_month                int64
num_card_12_month                    int64
num_auto_ 36_month                   int64
uti_open_card                      float64
pct_over_50_uti                    float64
uti_max_credit_line                float64
pct_card_over_50_uti               float64
ind_XYZ                              int64
rep_income                         float64
rep_education                       object
Def_ind                              int64
dtype: object
 
Number of duplicate accounts: 0
 
Size of train_df: (5000, 21)
Size of train_df2 (4133, 21)
C:\Users\Barry\AppData\Local\Temp\ipykernel_15592\4022763135.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df2['rep_education']=b;

Section 2: Logistic Regression

     Logistic regression is a supervised machine learning classifier algorithm for estimating the conditional probability, $P(Y|X)$; given matrix $X$ is a dataset of features, and $Y$ is a binomially distributed target random variable [8]. Peeling back the core of any neural network is its loss function (or cost function); the loss functon is what is used to attain a gradient (e.g. a vector derivative) for the Gradient descent algorithm which optimizes the weight matrices in the backward propagation process of the neruel network's training.
     As the loss function will be important for explaining regularization, we constructed the logistic regression function here. Suppose $Y$ is a Bernoulli distributed random variable and matrix $X$ is a collection of independent continuous or discete random variables - with no need for homoscedasticity - in otherword similar variences. In regards to the conditional probability, P(Y=y,X=x), we have

$$P(Y=y,X=x)=p^{y}(1-p)^{1-y},$$

though we do know the value of $p$, the probability of $y$ being $1$. We can remedy this by an approximation of p; suppose $\beta$ is a weight vector and a $\sigma(z)$ is the sigmoid function, then can derive the following,

$$z=\beta_{0}+\sum^{n}_{i} \beta_{i} x_{i} = \beta^{T}X \to P(Y=1,X=x)=\sigma(z)=\frac{1}{1+e^{-z}}.$$

Plugging in the approximate of $p$ into Bernoulli distribution formula we attain the estimated Bernoulli formula,

$$P(Y=y,X=x)=\sigma(\beta^{T}X)^{y}(1-\sigma(\beta^{T}X))^{1-y},$$

and by calculating Maximum likelihood estimate (MLE) of said estimated Bernoulli formulation, we are able to attain the loss function of logistic regression.

$$L(\beta)=\prod_{i}^{n} P(Y=y|X=x)=\prod_{i}^{n} \sigma(\beta^{T}X)^{y_{i}}(1-\sigma(\beta^{T}X))^{1-y_{i}}$$

$$ \to ln(L(\beta))= \ln\bigg( \prod_{i}^{n} P(Y=y|X=x)=\prod_{i}^{n} \sigma(\beta^{T}X)^{y_{i}}(1-\sigma(\beta^{T}X))^{1-y_{i}}\bigg)$$

$$ \to LL(\beta))= \sum_{i}^{n} \ln\bigg( \sigma(\beta^{T}X)^{y_{i}}(1-\sigma(\beta^{T}X))^{1-y_{i}} \bigg)$$

$$ \to LL(\beta))= \sum_{i}^{n} y_{i}\ln(\beta^{T}x_{i})+(1-y_{i})\ln(1-\beta^{T}x_{i}). \square$$

This MLE is the loss function to the logistic regression, and in essence to attained the optimum (e.g., desired) beta weight vector, we need to minimize (e.g. solve) this optimization problem. The most conventional approach is to utilize a gradient descent algorithm [8], which gradient derivative is defined by taking the partial derivative of the loss function in respect to each beta weight; $\frac{\partial LL(\beta)}{\partial \beta_{i}}$. To guarantees convergence to a global optimum by Gradient descent, matrix X must be symmetric positive definite (e.g. a convex optimization problem), with a learning rate less then $1$. This is due to the fact that the partial derivative of the loss function needs to be greater then zero but less than $1$ when sufficiently close to the minima (e.g. bottom of the valley), otherwise gradient descent will diverge or will get stuck insufficiently close to the minima - returning the wrong beta weight vector [9]. These details are very important to understand how regularization addresses overfitting in the logistic regression model; but we will further discuss regularization after the following analysis.
     We seperated the predictor features from train_df2 into a dataframe named train_X to form matrix $X$ and the target variable into dataframe named $train_Y$. Perform the same process for test_df2; we seperated the predictor features from test_df2 into test_X, and the target variable into test_Y. We then were able to implement the logistic regression function from sklearn.linear_model for our analysis.

In [10]:
list(train_df2.columns[0:20])
Out[10]:
['tot_balance',
 'avg_bal_cards',
 'credit_age',
 'credit_age_good_account',
 'credit_card_age',
 'num_acc_30d_past_due_12_months',
 'num_acc_30d_past_due_6_months',
 'num_mortgage_currently_past_due',
 'tot_amount_currently_past_due',
 'num_inq_12_month',
 'num_card_inq_24_month',
 'num_card_12_month',
 'num_auto_ 36_month',
 'uti_open_card',
 'pct_over_50_uti',
 'uti_max_credit_line',
 'pct_card_over_50_uti',
 'ind_XYZ',
 'rep_income',
 'rep_education']
In [11]:
labels=list(train_df2.columns[0:20]);
train_X=train_df2[labels]; train_Y=train_df2['Def_ind'];
test_X=test_df2[labels]; test_Y=test_df2['Def_ind'];

LR=LogisticRegression()
clf=LR.fit(train_X,train_Y)

prob_d=clf.predict_proba(test_X);
pred=clf.predict(test_X);
score=clf.score(test_X,test_Y);

print(pd.DataFrame(np.array([prob_d[:,0].T,prob_d[:,1].T,test_Y.T,pred]).T,
             columns=['prob_Def_Ind=0','prob_Def_Ind=1','Def_ind',"pred"]));

print(" ");
print("Beta_Coefficients from Logistic Regression model:");
print(pd.DataFrame(np.array(clf.coef_),
                   columns=list(train_df2.columns[0:20])).transpose());
print(" ");
print("Accuracy of the trained model was "+str(score));

cm=metrics.confusion_matrix(test_Y, pred);
plt.figure(figsize=(7,7));
seaborn.heatmap(cm,annot=True,fmt=".3f", 
            linewidths=.5,square=True,cmap='Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title='Accuracy Score:{0}'.format(score);
plt.title(all_sample_title,size=15);
print(" ")
recall=cm[0,0]/(cm[0,0]+cm[1,0]);
precision=cm[0,0]/(cm[0,0]+cm[0,1]);
accuracy=(cm[0,0]+cm[1,1])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1]);
F_measure=(2*recall*precision)/(recall+precision);
print("Recall: "+str(recall));
print("Precision: "+str(precision));
print("Accuracy: "+str(accuracy));
print("F-measure: "+str(F_measure));

logit_roc_auc=roc_auc_score(test_Y,clf.predict(test_X));
fpr,tpr,thresholds=roc_curve(test_Y, clf.predict_proba(test_X)[:,1]);
plt.figure();
plt.plot(fpr, tpr,label='Logistic Regression (area = %0.2f)'%logit_roc_auc);
plt.plot([0,1],[0,1],'r--');
plt.xlim([0.0,1.0]);
plt.ylim([0.0,1.05]);
plt.xlabel('False Positive Rate');
plt.ylabel('True Positive Rate');
plt.title('Receiver operating characteristic');
plt.legend(loc="lower right");
plt.savefig('Log_ROC');
plt.show();
      prob_Def_Ind=0  prob_Def_Ind=1  Def_ind  pred
0           0.863948        0.136052      0.0   0.0
1           0.881070        0.118930      0.0   0.0
2           0.058506        0.941494      0.0   1.0
3           0.950192        0.049808      0.0   0.0
4           0.899325        0.100675      0.0   0.0
...              ...             ...      ...   ...
4128        0.870301        0.129699      0.0   0.0
4129        0.912065        0.087935      0.0   0.0
4130        0.911367        0.088633      0.0   0.0
4131        0.948672        0.051328      0.0   0.0
4132        0.921004        0.078996      0.0   0.0

[4133 rows x 4 columns]
 
Beta_Coefficients from Logistic Regression model:
                                        0
tot_balance                     -0.000001
avg_bal_cards                   -0.000115
credit_age                      -0.005132
credit_age_good_account          0.000357
credit_card_age                  0.000814
num_acc_30d_past_due_12_months   0.000283
num_acc_30d_past_due_6_months    0.000061
num_mortgage_currently_past_due  0.000064
tot_amount_currently_past_due    0.000266
num_inq_12_month                 0.001132
num_card_inq_24_month            0.001739
num_card_12_month                0.000109
num_auto_ 36_month               0.000022
uti_open_card                    0.000214
pct_over_50_uti                  0.000167
uti_max_credit_line              0.000166
pct_card_over_50_uti             0.000181
ind_XYZ                         -0.000130
rep_income                       0.000001
rep_education                   -0.000064
 
Accuracy of the trained model was 0.8826518267602226
 
Recall: 0.9195342820181113
Precision: 0.9533261802575107
Accuracy: 0.8826518267602226
F-measure: 0.9361253786382194
No description has been provided for this image
No description has been provided for this image

     The trained logistic regression model (LR model) was tested on test_X and test_Y, and it was able to correctly predict roughly $88.30\%$ of accounts that would default on their credit card within the past 18 month since their credit account was opened by bank XYZ. To further ground this result, we computate the confusion matrix to calcalute recall, presicion, accuracy and F-measure of the trained LR model. The result was that recall was approximatly $92.00\%$, which means the trained LR model correctly predicted $92.00\%$ of the true positive cases. Precision was $95.00\%$, which means that from all the cases marked predicted as true, $95.00\%$ of said cases were actually true. As attained by clf.score(), the accuracy of the trained LR model was $88.30\%$; which simply means that the train LR model was able to predict correctly $88.30\%$ of the test accounts.
     As we had to rank a trained LR model against a trained random Forest model, and we decided to computate the F-measure and ROC curve [10]; the resulting F-measure was on the high end of the scale - which is desired. In regards to the ROC curve, it is desire that the precision rate curve remains far away from the forty=five degree angled line curve for as long as possible. Nevertheless the F-measure of the train LR model was measured to be $0.9361$, which means the would be considered a good model; thoough we can still improve said LR model to attain a higher accuracy.
     We believed the trained LR model was overfitted against the training dataset, train_df2, and hence required regularization to assist gradient descent to get closer to the minima of the loss function. We remind the reader that gradient descent can only reaches the minima in $O(1/\epsilon)$ steps, but prior to achieving $O(1/\epsilon)$ steps - the gradient descent is sufficiently close to the minima but not at the minima.
     Regularization of a neural network can take the form of L1-Regularization with an added L1-normalization proportioned by $frac{\lambda}{2n}$ term to the cost function, or L2-Regularization with an added L2-normalization proportioned by $frac{\lambda}{2n}$ term to the cost function, or dropout technique - which give a certain probability that a element of the weight matrix is dropped to zero. In our case we choice L2-regularization has it has a greater effect then L1-regularization at reducing variance in a train neural network (e.g., trained model). Recollect the LR loss function, to make it the cost function we need to average it, then we can add L2-regularization term to attain the new regularizad cost function,

$$Cost(\beta))= \frac{1}{n}\sum_{i}^{n} y_{i}\ln(\beta^{T}x_{i})+(1-y_{i})\ln(1-\beta^{T}x_{i})+\frac{\lambda}{2n}.$$

In simple term, L1 and L2 regularization is an additional dial, which is similar to learning rate at adjusting the gradient magnitude to conform to Theorem 6.2 from [9].

Theorem 6.2 Suppose the function $f : \R^{n} \to \R$ is convex and differentiable, and that its gradient is Lipschitz continuous with constant $L > 0$, i.e., we have that $\Vert \bigtriangledown f(x) \bigtriangledown f(y) \Vert_{2} \leq L \Vert x - y \Vert$ for any $x$,$y$. Then if we [ran the] gradient for $k$ iterations with step size $t_{i}$ chosen using backtracking line search on each iteration, it will yield a solution $f^{(k)}$ which satisfies

$$f(x^{(k)})-f(x^{*}) \leq \frac{\vert x^{0} - x^{0} \Vert_{2}^{2}}{2 t_{\text{min}}k}$$

where $t_min=\min{1,\beta/L}$.

In simple terms want to select a $\lambda$ value that adjust the gradient in such a way to get closer to the minima or in other word, attain a better approximation to the optimal beta weight vector (e.g. minima).
     To implement L2-regularization is rather simple and only requires to pass these parameters penalty="l2", solver="liblinear", tol=1e-6,max_iter=int(1e6), warm_start=True, intercept_scaling=10000.0 into the sklearn.linear_model.LogisticRegression. Lastly we performed the same model validation process as last time.

In [12]:
LR=LogisticRegression(penalty="l2",solver="liblinear",tol=1e-6,
    max_iter=int(1e6),warm_start=True,intercept_scaling=10000.0);
clf=LR.fit(train_X,train_Y);

prob_d=clf.predict_proba(test_X);
pred=clf.predict(test_X);
score=clf.score(test_X,test_Y);

print(pd.DataFrame(np.array([prob_d[:,0].T,prob_d[:,1].T,test_Y.T,pred]).T,
             columns=['prob_Def_Ind=0','prob_Def_Ind=1','Def_ind',"pred"]));

print(" ");
print("Beta_Coefficients from Reg. Logistic Regression model:");
print(pd.DataFrame(np.array(clf.coef_),
                   columns=list(train_df2.columns[0:20])).transpose());
print(" ");
print("Accuracy of the trained model was "+str(score));

cm=metrics.confusion_matrix(test_Y,pred);
plt.figure(figsize=(7,7));
seaborn.heatmap(cm, annot=True,fmt=".3f", 
            linewidths=.5,square=True,cmap='Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title='Accuracy Score:{0}'.format(score);
plt.title(all_sample_title,size=15);
print(" ");
recall=cm[0,0]/(cm[0,0]+cm[1,0]);
precision=cm[0,0]/(cm[0,0]+cm[0,1]);
accuracy=(cm[0,0]+cm[1,1])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1]);
F_measure=(2*recall*precision)/(recall+precision);
print("Recall: "+str(recall));
print("Precision: "+str(precision));
print("Accuracy: "+str(accuracy));
print("F-measure: "+str(F_measure));

logit_roc_auc=roc_auc_score(test_Y,clf.predict(test_X));
fpr,tpr,thresholds=roc_curve(test_Y,clf.predict_proba(test_X)[:,1]);
plt.figure();
plt.plot(fpr, tpr,label='Logistic Regression (area = %0.2f)'%logit_roc_auc);
plt.plot([0,1],[0,1],'r--');
plt.xlim([0.0,1.0]);
plt.ylim([0.0,1.05]);
plt.xlabel('False Positive Rate');
plt.ylabel('True Positive Rate');
plt.title('Receiver operating characteristic');
plt.legend(loc="lower right");
plt.savefig('Log_ROC');
plt.show();
      prob_Def_Ind=0  prob_Def_Ind=1  Def_ind  pred
0           0.933852        0.066148      0.0   0.0
1           0.761364        0.238636      0.0   0.0
2           0.712649        0.287351      0.0   0.0
3           0.959617        0.040383      0.0   0.0
4           0.935899        0.064101      0.0   0.0
...              ...             ...      ...   ...
4128        0.846793        0.153207      0.0   0.0
4129        0.935317        0.064683      0.0   0.0
4130        0.752892        0.247108      0.0   0.0
4131        0.967526        0.032474      0.0   0.0
4132        0.945871        0.054129      0.0   0.0

[4133 rows x 4 columns]
 
Beta_Coefficients from Reg. Logistic Regression model:
                                            0
tot_balance                     -1.567375e-06
avg_bal_cards                   -1.214146e-04
credit_age                      -4.323569e-03
credit_age_good_account          5.987032e-04
credit_card_age                 -4.973357e-04
num_acc_30d_past_due_12_months   9.623256e-01
num_acc_30d_past_due_6_months    2.252230e-01
num_mortgage_currently_past_due  2.290428e-01
tot_amount_currently_past_due    1.910164e-05
num_inq_12_month                 3.955467e-01
num_card_inq_24_month           -7.654231e-02
num_card_12_month                1.820608e-01
num_auto_ 36_month               3.647885e-02
uti_open_card                    8.876689e-01
pct_over_50_uti                  6.728455e-01
uti_max_credit_line              6.641946e-01
pct_card_over_50_uti             7.199934e-01
ind_XYZ                         -3.106927e-01
rep_income                       6.300817e-07
rep_education                   -9.712513e-02
 
Accuracy of the trained model was 0.9044277764335834
 
Recall: 0.92372234935164
Precision: 0.9745171673819742
Accuracy: 0.9044277764335834
F-measure: 0.9484401514162641
No description has been provided for this image
No description has been provided for this image

     The regularized trained logistic regression model (reg LR model) was tested on test_X and test_Y, and it was able to correctly predict roughly $94.00\%$ of accounts that would default on their credit card within the past 18 month since their credit account was opened by bank XYZ. To further ground this result, we computate the confusion matrix to calcalute recall, presicion, accuracy and F-measure of the trained LR model. The result was that recall was approximatly $94.00\%$, which means the trained reg LR model correctly predicted $97.50\%$ of the true positive cases. Precision was $97.50\%$, which means that from all the cases marked predicted as true, $97.50\%$ of said cases were actually true. As attained by clf.score(), the accuracy of the trained LR model was $90.40\%$; which simply means that the train reg LR model was able to predict correctly $90.40\%$ of the test accounts. This is a significant improvement over the non-regularized train LR model; which also further confirms that the non-reg LR model was slightly overfitted - though the L2 regularization was able to reduce varience in the model enough to improve the accuracy of the model.

Section 3: Random Forest

     Random Forest algorithm is a classifier based on a combination of tree predictors [11], which are PRN (pseudo-random number) sampled from indepent and identically distributed random variables as stated in Definition 1.1[11].
     To implement sklearn.ensemble.RandomForestClassifier, we had to standardize each predictor feature and target variable to a standard normal distribution; we accomplished this aim through StandardScaler() and fit_transform(). We completed this process for both train_df2 and test_df2. With the standardizse data, ss_X and ss_Y, we split the standardize dataframe into train and test predictor feature sets, and train and test target variable sets. We were then able to train and test the random forest model. Lastly we conduct the same validation process as we conduct for the LR model.

In [13]:
from sklearn.model_selection import train_test_split
In [16]:
ss_X_df.columns[0:20]
Out[16]:
Index(['tot_balance', 'avg_bal_cards', 'credit_age', 'credit_age_good_account',
       'credit_card_age', 'num_acc_30d_past_due_12_months',
       'num_acc_30d_past_due_6_months', 'num_mortgage_currently_past_due',
       'tot_amount_currently_past_due', 'num_inq_12_month',
       'num_card_inq_24_month', 'num_card_12_month', 'num_auto_ 36_month',
       'uti_open_card', 'pct_over_50_uti', 'uti_max_credit_line',
       'pct_card_over_50_uti', 'ind_XYZ', 'rep_income', 'rep_education'],
      dtype='object')
In [25]:
X_test
Out[25]:
tot_balance avg_bal_cards credit_age credit_age_good_account credit_card_age num_acc_30d_past_due_12_months num_acc_30d_past_due_6_months num_mortgage_currently_past_due tot_amount_currently_past_due num_inq_12_month ... num_card_12_month num_auto_ 36_month uti_open_card pct_over_50_uti uti_max_credit_line pct_card_over_50_uti ind_XYZ rep_income rep_education Def_ind
14056 0.667604 0.059683 0.492772 0.721140 0.382498 -0.333932 -0.167036 -0.175574 -0.197555 -0.531987 ... -0.559488 -0.436351 -0.420380 0.600560 -0.420092 -1.108982 1.738244 0.468127 0.252436 -0.336847
3466 0.010697 -0.043129 -1.580245 -2.437935 -1.838257 1.781311 -0.167036 -0.175574 1.360689 0.335455 ... -0.559488 -0.436351 1.235583 1.274821 2.250931 1.533553 1.738244 0.841237 -1.283487 -0.336847
9307 -0.200614 -0.721324 1.502004 1.031869 1.407462 -0.333932 -0.167036 -0.175574 -0.197555 1.202897 ... -0.559488 2.208918 -1.471210 -1.054665 -0.342764 -0.801214 1.738244 0.944058 0.252436 -0.336847
9338 0.758774 0.908916 -0.530098 -0.547669 -0.362930 -0.333932 -0.167036 -0.175574 -0.197555 0.335455 ... -0.559488 -0.436351 0.234652 0.128746 0.190185 0.627270 -0.575293 0.512623 -1.283487 -0.336847
18813 -0.136315 0.550532 0.656431 -0.159258 0.941570 -0.333932 -0.167036 -0.175574 -0.197555 -0.531987 ... -0.559488 -0.436351 -1.625749 -2.048498 -0.921564 -2.090365 -0.575293 0.617083 0.252436 -0.336847
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13636 2.168404 1.744654 0.015433 -0.392304 0.366969 -0.333932 -0.167036 -0.175574 -0.197555 -0.531987 ... -0.559488 -0.436351 -0.362622 0.722125 0.002852 0.079278 -0.575293 0.630887 -1.283487 -0.336847
6996 -0.391903 -0.317970 -0.584651 -0.936080 -1.310245 1.781311 -0.167036 -0.175574 -0.197555 -0.531987 ... -0.559488 -0.436351 0.632993 0.964332 0.341807 0.067612 1.738244 -0.498523 0.252436 -0.336847
5805 1.040378 1.173876 1.815685 1.420280 1.190046 1.781311 -0.167036 -0.175574 -0.197555 -0.531987 ... -0.559488 2.208918 0.736162 0.196673 1.353546 -0.066670 1.738244 -0.539789 1.788360 -0.336847
11259 0.711189 1.261833 -0.216418 -1.117338 -0.735644 3.896554 5.521415 5.695591 4.375447 -0.531987 ... -0.559488 -0.436351 -0.426881 -1.330365 -0.939289 -0.874243 -0.575293 -0.341298 1.788360 -0.336847
16995 1.064149 1.178914 -1.743905 -1.013762 -2.009084 -0.333932 -0.167036 -0.175574 -0.197555 -0.531987 ... 1.495466 -0.436351 1.693768 0.282412 1.400116 0.542642 -0.575293 2.287862 0.252436 -0.336847

4996 rows × 21 columns

In [15]:
ss_X=StandardScaler().fit_transform(train_df2);
ss_Y=StandardScaler().fit_transform(test_df2);
ss_X_df=pd.DataFrame(ss_X, columns=train_df2.columns,index=train_df2.index);
ss_Y_df=pd.DataFrame(ss_Y, columns=test_df2.columns,index=test_df2.index);
labels=list(ss_X_df.columns[0:20]);
X_train=ss_X_df[labels]; y_train=ss_X_df['Def_ind'];
X_test=ss_Y_df[labels]; y_test=ss_Y_df['Def_ind'];

X_train,X_test,y_train,y_test=train_test_split(ss_X_df[labels],train_Y,
                                               test_size=0.3,random_state =0);
classifier=RandomForestClassifier(n_estimators=100); 
print("classifer: RandomForestClassifier");

classifier=classifier.fit(X_train,y_train);
predicted=classifier.predict(X_test);
score2=classifier.score(X_test,y_test);

print(pd.concat([y_test,pd.Series(predicted,index=y_test.index,
                                    name='predicted')], axis=1));
print(" ");
print(classifier.predict_proba(X_test));
print(" ");
print("Accounts that Defaulted: "+str(sum(y_test))+"  "+
      "Accounts predcited to Defualt: "+str(sum(predicted)));
print("Accuracy: ", accuracy_score(y_test, predicted));
print(" ");
print("Decision path of Random Forest algorithm:")
print(classifier.decision_path(X_test));


clf2=DecisionTreeClassifier(max_depth=2,random_state=0);
clf2=clf2.fit(X_test,y_test);
tree.plot_tree(clf2);

cm=metrics.confusion_matrix(y_test,predicted);
plt.figure(figsize=(7,7));
seaborn.heatmap(cm, annot=True,fmt=".3f", 
            linewidths=.5,square=True,cmap='Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title='Accuracy Score:{0}'.format(score2);
plt.title(all_sample_title,size=15);
print(" ");
recall=cm[0,0]/(cm[0,0]+cm[1,0]);
precision=cm[0,0]/(cm[0,0]+cm[0,1]);
accuracy=(cm[0,0]+cm[1,1])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1]);
F_measure=(2*recall*precision)/(recall+precision);
print("Recall: "+str(recall));
print("Precision: "+str(precision));
print("Accuracy: "+str(accuracy));
print("F-measure: "+str(F_measure));

rf_roc_auc=roc_auc_score(y_test,classifier.predict(X_test));
fpr,tpr,thresholds=roc_curve(y_test,classifier.predict_proba(X_test)[:,1]);
plt.figure();
plt.plot(fpr, tpr,label='Random Forest (RF) (area = %0.2f)'%rf_roc_auc);
plt.plot([0,1],[0,1],'r--');
plt.xlim([0.0,1.0]);
plt.ylim([0.0,1.05]);
plt.xlabel('False Positive Rate');
plt.ylabel('True Positive Rate');
plt.title('Receiver operating characteristic');
plt.legend(loc="lower right");
plt.savefig('RF_ROC');
plt.show();
classifer: RandomForestClassifier
       Def_ind  predicted
14056        0          0
3466         0          0
9307         0          0
9338         0          0
18813        0          0
...        ...        ...
13636        0          0
6996         0          0
5805         0          0
11259        0          0
16995        0          0

[4996 rows x 2 columns]
 
[[0.99 0.01]
 [0.79 0.21]
 [0.99 0.01]
 ...
 [0.9  0.1 ]
 [0.81 0.19]
 [0.71 0.29]]
 
Accounts that Defaulted: 494  Accounts predcited to Defualt: 157
Accuracy:  0.91693354683747
 
Decision path of Random Forest algorithm:
(<4996x181700 sparse matrix of type '<class 'numpy.int64'>'
	with 8461917 stored elements in Compressed Sparse Row format>, array([     0,   1921,   3840,   5673,   7506,   9389,  11196,  13059,
        14886,  16679,  18436,  20185,  22054,  23889,  25740,  27505,
        29396,  31155,  32944,  34729,  36616,  38507,  40390,  42221,
        44098,  45851,  47666,  49493,  51286,  53165,  55000,  56787,
        58572,  60451,  62210,  63971,  65786,  67625,  69462,  71229,
        73070,  74913,  76666,  78401,  80246,  82195,  83992,  85849,
        87636,  89501,  91300,  93111,  94850,  96639,  98436, 100215,
       102050, 103787, 105560, 107389, 109192, 110943, 112826, 114637,
       116490, 118303, 120176, 121949, 123756, 125605, 127432, 129271,
       131120, 132901, 134724, 136481, 138288, 140051, 141878, 143667,
       145552, 147277, 149090, 150903, 152722, 154511, 156354, 158231,
       160060, 161827, 163626, 165457, 167234, 168985, 170840, 172631,
       174488, 176281, 178072, 179941, 181700], dtype=int32))
 
Recall: 0.9222979954536061
Precision: 0.9913371834740116
Accuracy: 0.91693354683747
F-measure: 0.9555722085429824
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

     Overall predictive capacity of the random forest algorithm was slightly better than the regularized logistic regression. The slight improvement in the overal accuracy of the random forest algorithm resulted in a small increased in the F-measure, but a completely negligible increase in the ROC curve and thus the area captured underneath the curve.
     The way to regularize random forest is the pruning of the decision tree, where branch decisions are cut from the decision tree. But in our case the decision tree lacks depth, and hence cannot be prune - as all decision branches are vital to the decision tree. In effect there is no means to regularize our model, as all branches are vital and cannot be pruned to reduce any possible overfitting in the model.

Section 4: Compare Results and Select Model

     From our analysis we observed that for this collection of data, the logistic regression classifier was slightly more prone to overfitting the training set, than the random forest classifier. Though the overfitting was somewhat mitigated with L2-regularization of the logistic regression, which improved the ovarall performance of the logistic regression classifier.
     Observing the F-measure and ROC graph, it is evident that either algorithm would be suited as a model to predict the likelihood of a future defualt by a credit card applicant. As the generality in predictive capacity of the random forest classifier seems to be significantly greater then the logistic regression for this dataset and problem, but it is only slightly greater than the former algorithm. In consequence we would select random forest as the model to review credit card applications, because it possessed a better F-measure.

Section 5: How Model Improves Decision Making?

     As no model is hundred percent correct hundred percent of the time; it is important to validate the prediction result's of the random forest model. In essence the applicants data would be formatted into a readable csv file, imported into a python envorinment and feed into the model, so to return its prediction results, which would be added to the applicant dataset.
     In a real-world application, we would not possess a training set to compare the results of the random forest model, we would have to rely on correlative relationships between the predictor features and the target variable, Def_ind. We remind the reader that in Section 1, through the correlation matrix, we were able to discern that the following features possessed a weak positive relationship with the target variable, Def_ind:

num_acc_30d_past_due_12_months, num_acc_30d_past_due_6_months, num_mortgage_currently_past_due, tot_amount_currently_past_due and num_card_12_month.

With which that the applicant dataset could be reduce to the accounts that the random forest model predicts would default on their credit card; let's label it the defualt dataset. One then would discern if any of these account possessed one or more of the weak positive relation features, and if they did - it would validate the model prediction; as possessing one of these predictor feature increases the probability of defualt.

References:

The enternal links are also references. Just click the linked numbers to be directed to the referenced source.

  • [1]: Stackoverflow: How do I set custom CSS for my IPython/IHaskell/Jupyter Notebook?, https://stackoverflow.com/questions/32156248/how-do-i-set-custom-css-for-my-ipython-ihaskell-jupyter-notebook
  • Wan, Shuyan, et al. “Model Selection for Credit Card Approval - Ohio State University.” Www.asc.ohio-State.edu, Ohio State University, https://www.asc.ohio-state.edu/goel.1/STATLEARN/PROJECTS/Presentations/CreditCardApproval.pdf.